GatewayNode

/blog/

Web scraping with Rust

created:	2021-01-15T02:26:39Z
modified:	2021-01-17T03:39:00Z

So recently I needed to pull some data out of a web page and thought to do it with Rust. I noticed some of the docs and tutorials are a little out of date so I thought I’d share what worked for me.

To start off with I’ll be using two crates, reqwest for easy HTTP requests and select for parsing out what we need from the webpage DOM. So my initial Cargo.toml looks like this.

Cargo.toml

[dependencies]  
reqwest = { version = "0.11", features = ["blocking", "json"] }  
select = "0.5.0"

You might notice I’m specifically using the feature "blocking" in the reqwest crate, the default way appears to be asynchronous as of not too long ago, but for scraping we have a few things we need to be careful about. First among those is our volume of requests, async will stack requests up as fast as possible which is exactly what we don’t want. It’s fairly easy to trigger all sorts of denial of service protections by making too many requests too quickly, so we are instead going to use the blocking feature to make one request at a time. We wouldn’t want our IP address on any blacklists for misbehaving scrapers, so let’s just go low and slow.

main.rs

use reqwest;
use select;

use select::document::Document;
use select::predicate::Name;


// Main should be were we compose other components so we start off simple here
fn main() {
	let payload = get_things("https://news.ycombinator.com");
	println!("{:#?}", payload);
}

// Our get things function is where we do most of our work
fn get_things(url: &str) -> Vec<String> {
	let mut found: Vec<String> = Vec::new();
	// Here’s the request call in bocking mode
	let resp = match reqwest::blocking::get(url) {
		Err(why) => panic!("Response unwrap: {}", why),
		Ok(value) => value,
	};

	// Now that we have our response we can pass it to select using Document
	match Document::from_read(resp) {
		Err(why) => panic!("Read document with select issue: {}", why),
		Ok(value) => {
			value
				.find(Name("a"))
				.filter_map(|n| n.attr("href"))
				.for_each(|x| {
					if x.starts_with("http") {
						// This will skip anything that is not an external link
						found.push(x.to_string());
					}
				});
		}
	}
	found
}

So I didn’t take any unwrap() or ? shortcuts in the error handling even though I did use panic!(). I’m noticing this helps me identify where the error is happening and handle it more easily in the future.

The page request with shortcuts would have looked like this.

let resp = reqwest::blocking::get(url).unwrap();

To exhaustively break down what is happening in the select use of Document struct:

We are instantiating Document from a readable object with the ::from_read(resp) call. This is wrapped in a match statement instead of a trailing unwrap() so we have finer grained control over what happens. I use a panic here, but there is little reason we couldn’t try again three times before calling a final panic or some other thing we want to try to do.
The find(Name("a")) returns a collection of items of the type "a" tag from the DOM.
Then we filter() to collect just the "href" attributes.
Loop through those href’s and just push the externally facing ones into our Vec<String>. This could probably be another filter, but this works.

NOTE: Not to pick on HackerNews here, it’s just a very simple DOM, with very little javascript, and it’s heavily cached. Just a good site that makes a good example.

...this works, but isn’t what I’m building. More to follow.

Feed reading with Rust>>

Author: Gatewaynode